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O^-l ■ Abstract. In this paper we argue that contextual multi-armed bandit 

algorithms could open avenues for designing self-learning security mod- 
ules for computer networks and related tasks. The paper has two con- 
tributions: a conceptual one and an algorithmical one. The conceptual 
O^J 1 contribution is to formulate - as an example - the real-world problem 

of preventing SPIT (Spam in VoIP networks), which is currently not 
satisfyingly addressed by standard techniques, as a sequential learning 
problem, namely as a contextual multi-armed bandit. Our second contri- 
bution is to present CMABFAS, a new algorithm for general contextual 
multi-armed bandit learning that specifically targets domains with fi- 
_ nite actions. We illustrate how CMABFAS could be used to design a 

fully self-learning SPIT filter that does not rely on feedback from the 
end-user (i.e., does not require labeled data) and report first simulation 
results. 
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SPIT is an acronym for spam in internet telephony and refers to unsolicited calls 
■ that, when answered by a human, would deliver a pre-recorded message (e.g., 

£NJ . advertisement or phishing attempts). Similar to spam in emails, SPIT exploits 

the openness of the existing infrastructure (e.g., no strongly authenticated identi- 
ties) together with the fact that VoIP calls can be easily generated automatically 
and at zero (or very low) costs. Unlike with spam in emails however, where the 
rN . content consists of text and can be analyzed before it is delivered, the content of 

a phone call (a voice stream) is only available when the call is answered. Thus 
many of the defensive measures that are effective against email spam do not 
directly translate to SPIT mitigation. Previously, some first ideas have already 
been suggested to address this problem. They range from reputation-based [1115] 
and call-frequency based [21] dynamic black-listing, fingerprinting [26 , to chal- 
lenging suspicious calls by captchas [20123118] . or the use of standard machine 
learning such as anomaly detection [16114] , clustering 25 , or decision trees [TS] . 
We believe that with respect to SPIT prevention these earlier solutions suffer 
from one or both of the following two shortcomings: (1) they are built on weak 
"features" (i.e., information from the protocol header SIP which in essence are 
text strings produced by the VoIP client) which are fairly easy to manipulate for 
a sophisticated hacker; (2) they are built as a static defense from labeled training 
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data (e.g., signatures of known attacks), require constantly manual adjustments 
from a human domain expert and are thus vulnerable to novel attacks. 

In this paper we explore a novel paradigm in machine learning, namely rein- 
forcement learning, to attack this problem from a new angle. Specifically, we use 
contextual multi-armed bandits to design a self-learning SPIT filter which dy- 
namically selects from among several security policies (i.e., voice captchas and 
computational puzzles) the most appropriate one and which does not rely on 
explicit feedback from the end-user (i.e., labeled training data) — instead, the 
SPIT filter monitors its own performance and generates internal rewards from 
subsequent traffic. 

The paper is structured as follows. We begin in Section 2 with describing 
the philosophy behind the design of our SPIT filter and motivate the use of 
contextual bandits to implement it in the real world. The following Section 3 
introduces our algorithm CMABFAS, a variant of a contextual MAB for learning 
in finite action spaces over generalized metric spaces, which will have exactly the 
properties we need for our SPIT filter. Note that our description of CMABFAS in 
this section will be kept general (such that it can be applied to other problems as 
well); in Section 4 we then describe in detail how we can map the SPIT prevention 
problem from Section 2 to CMABFAS. Section 5 then presents some simulation 
results and compares CMABFAS with a more naive baseline implementation of 
MABs. 

2 Background and related work 

Before we can start, we feel it necessary to give a fair warning to the reader. At 
the present time, VoIP telephony has not (yet) replaced traditional telephony 
and the problem of SPIT is largely a hypothetical one. In particular, there is 
no publicly available datasetQ and little experience of what SPIT will look like 
in the real world. To the best of our knowledge, the "empirical evaluation" of 
all earlier research on SPIT prevention is therefore based on guesswork and 
simulation. In this paper, we will face the same situation; however, our method 
also has to interact with the calling party - which is even harder to simulate 
realistically and cannot be done from a static dataset. Our method will therefore 
also be evaluated in only a simplified testbed where the behavior of SPIT bots 
is "emulated" by distributions the parameters of which are synthetically chosen 
by hand. 

1 The earlier work described in [I7j set out to precisely change that. In it the authors 
describe a methodology for creating SPIT traffic and also provide a common data 
set for the use in benchmark comparisons. However, the data set they provide is 
generated from "emulated users based on a social model"; in essence, the authors 
use common tools to generate the SPIT traffic, where the relevant features, such as 
call duration, inter-arrival time, behavior upon receiving a call, etc. are all modeled 
by sampling from distributions. For example, the call duration was generated from 
an exponential distribution the parameter of which was specified by hand (which 
amounts to the same as what we do here). 
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2.1 Related work: Designing a SPIT filter 

We consider the inbound scenario, where the SPIT filter is located close to the 
recipient of the call (e.g., at the VoIP proxy) and decisions must be made on a 
per-call basis without having explicit history information on a per-source basis 
(thus ruling out reputation-based methods). For every incoming call the system 
has to decide whether it is a regular call or a SPIT call using only features 
directly extracted from the request, i.e., text strings extracted from the fields 
of the Session Initiation Protocol (SIP) header in an INVITE message. Current 
hardware phones are identifiable through the specific SIP header they produce, 
and it is conceivable that SPIT bots could be equally identifiable through the 
specific SIP header they produce. Nevertheless we believe that, from this infor- 
mation alone, traditional static techniques are not sufficient to build a strong 
filter that detects and blocks SPIT with high accuracy. This has two reasons: 
(1) any such signature-based defense will require a human expert to manually 
identify and add signatures of SPIT bots to a list of known attacks which is a 
costly procedure and leaves the system vulnerable to novel attacks; and (2) the 
information in the protocol header is weak in the sense that it can be easily ma- 
nipulated by a sophisticated hacker to make a SPIT call appear as if originating 
from a regular device. 

An interesting idea suggested in |20|7) and on which we are going to build 
is for the defensive system to collect additional information that would be a lot 
harder for SPIT to manipulate; so-called Turing tests or voice CAPTCHAS that 
would actively interact with the calling party. For example, before forwarding a 
call, an automated mechanism could prompt a suspicious caller to dial a short 
sequence of randomly generated digits. Both the reaction of the caller to the test 
(a bot is not likely to obey telephone etiquette and would immediately start to 
play back its pre-recorded message) , as well as the result of the test itself (only 
a very sophisticated bot will be able to break a voice CAPTCHA) will reveal 
additional information about whether or not the caller is a human or a SPIT bot. 
A large number of these security challenges, most of which are parameterized 
to generate an infinite variety, already exists today; however, deciding which of 
these security challenges to best apply given the features of a call is currently 
done by a human expert (e.g., see NEC's SEAL [M]). Deciding for which call 
what security challenge to best apply is, however, not trivial. On the one side, 
applying a challenge will reveal additional information about the call being SPIT 
or not. On the other side, applying it will also carry certain costs, namely: (1) 
annoying the calling party; (2) additional computational resources; and (3) ob- 
fuscation, meaning that we would prefer to avoid exposing all capabilities of our 
defense system such that attackers can not start to learn from them. The essen- 
tial point here is that, while it would result in the least number of mistakes, we 
cannot afford to apply our strongest but likely most "costly" security challenge 
to every single call. 

Based on this design for a SPIT filter which can choose from many possible 
security challenges or actions (where we include "apply no security challenge" as 
just another action), our goal is to create a self-learning SPIT filter which does 
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SPIT bot) 
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not rely on hand-coded rules but automatically determines from past experience 
what the "best" security challenge should be for a given call. This self-learning 
does not rely on external feedback; instead the system monitors its own perfor- 
mance and generates internal rewards. Moreover, this self-learning also ensures 
that the system will adapt to new variants of SPIT as part of its normal opera- 
tion. 

A sketch of the basic interaction loop between 
caller and SPIT filter is shown to the right. In this 
figure, calls are processed sequentially (individually 
one by one). Every time a call arrives at the SPIT 
filter, the SPIT filter selects one action and applies 
it to the call. Each action forces a response from the 
caller (e.g., passing/failing a security challenge or, if 
the call is made, features from the call such as call 
duration, amount of double talk, etc.). This response 
is analyzed by the internal reward generator by first 
inferring whether or not the caller is a SPIT bot. 
Then, depending on the outcome of this inference 
stage (which may be a probability for the call being 
SPIT), the nature of the action chosen (if it is likely 
SPIT, did we chose an action that tried to prevent 

it), and the cost of the action, a scalar reward is returned to the SPIT filter. 
From this reward the SPIT filter updates its internal (call,action)-scores and 
proceeds with the next call in the queue. 

In summary, to implement this design for a SPIT filter, we have to address 
two issues: (1) how to implement the reward generator; (2) how to implement 
the action selection and learning part. Note that in this paper we focus on the 
latter. 



Internal 
reward 
generator 



place 



reward for (x,a) 



SPIT filter 

(VoIP proxy) 



choose 
security 
challenge 



2.2 Background: Multi-armed bandits 

To implement learning, we formalize our SPIT filter as a multi-armed bandit 
problem (MAB) with context. Standard MABs are well-studied models for se- 
quential decision-making when the outcome is stochastic and its distribution a 
priori unknown. In a standard MAB we assume we get to play the following 
"game" over multiple rounds: suppose we are given n different choices or actions 
and each action is associated with a stochastic reward function (that stays the 
same for all rounds we play but may be different for each action) . In every round 
of the game we have to choose one of the n possible actions, and in doing so we 
obtain a random reward sampled from the underlying distribution. Our goal is 
to choose actions such that the sum of rewards we obtain is maximized. 

Naturally, the best action would be to always choose the action yielding the 
highest expected reward. However, the reward distributions are not revealed to 
the player and thus it is (initially) unknown which action will produce the highest 
reward. To solve this problem, we have to form an estimate for each action about 
what reward we might get, based on what results we have obtained in earlier 
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rounds of the game. Of course, the more often we have tried a particular action 
in the past, the more certain and reliable this estimate will be. The fundamental 
dilemma is now how to balance exploitation (choosing what currently appears 
to be the best action) and exploration (choosing a non-greedy action to improve 
our estimate and potentially obtain higher rewards in the future) . The standard 
MAB problem with finite (and small) number of actions is largely considered to 
be a solved problem and provable optimal strategies exist in the literature [12|2j . 

A contextual MAB is in principle like a standard MAB with the major differ- 
ence that it is defined over some large set of elements or a continuous space. Now 
each element of the set corresponds to a separate standard MAB (in some formu- 
lations also the action space becomes large or continuous). Learning the reward 
distributions in contextual MABs is more challenging than it is for standard 
MAB since it is no longer possible to sample the same action multiple times. In- 
stead, we have to impose "smoothness" as additional structural assumption; i.e., 
we assume that elements of the context space that are "similar" (with respect 
to some similarity measure) will also behave similarly. Using generalization we 
can then try to predict the outcome for new cases based on previously observed 
outcomes for similar cases. Contextual MAB are nowadays an active research 
topic with many relevant real- world applications, e.g., placement of web adver- 
tisements. See |24|13|19 22 10 8 for some examples. 

We believe that contextual MABs (but not standard MABs) are a good 
description of what the SPIT filter motivated in the previous section is trying 
to achieve: the contexts correspond to calls (represented by SIP headers), the 
actions correspond to security challenges the filter can choose from, and the 
rewards correspond to how the calling party reacts. 

3 Description of CMABFAS 

This section describes our algorithm CMABFAS for contextual MAB in finite 
action spaces. Note that we keep the presentation general, the actual application 
to the SPIT prevention scenario will be described in the following Section |4j Our 
work is largely based on the contextual zooming algorithm described in }22j , and 
inspired by the X-armed bandit learning algorithm described in [4]. Differences 
between CMABFAS and [22] are: a specialization to the finite action case and a 
modified scheme to estimate the expected rewards which works for more general 
metric spaces (we do not need the triangle inequality). 

3.1 Notation 

We begin by introducing some notation. Let X denote the context space with 
elements x £ X and let a £ {1, . . . , k} denote the possible actions that can be 
chosen for each x (we assume that we have the same choice of actions available 
for each x). Each context x can be seen as an index to a conventional k- armed 
bandit: for each x we have a distribution of rewards lZ a (x) under action a which 
models the stochastic response from the environment when performing action a 
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in context x. Let r a (x) denote the random values drawn from the corresponding 
reward distribution, i.e., r a (x) ~ 1Z a (x). We assume that TZ a (x) has bounded 
support which, for notational convenience, is supposed to lie in the unit interval; 
thus we assume supplZ a (x) C [0, 1]. In consequence, we also have r a (x) G [0, 1]. 
Let /J, a (x) denote the mean of the reward distribution TZ a (x); i.e., 

H a (x) := E r a {x) ^ K a (x) [r a {x)\ . (I) 

The essence of the problem we study is that lZ a (x), and hence ^ a (x), is not 
known when making decisions; instead it is treated as a black-box from which 
only samples can be drawn. Our overall goal is, when presented with any context 
x, to be able to choose the action with the highest mean: argmax a ^ a {x). The 
algorithm we propose is based on taking and averaging samples in a smart way 
such that observations we made at one location x are reused to make an estimate 
about the reward distribution at another location x' . 

Let context space X be equipped with a distance metric d(x,x') which mea- 
sures the distance between any two elements x, x' G X . We assume that d is a 
pseudo-metric and fulfills the conditions of (1) non-negativity: d(x 7 x') > 0; (2) 
symmetry: d(x,x') = d(x\x); and (3) is equal to zero if and only if the argu- 
ments are identical: d{x,x') = x = x' , Vx,x' G X. Furthermore we assume 
that, for notational convenience again, d is scaled such that the diameter of X 
with respect to d is equal to one; that is, sup x x / e;t > d(x,x') = 1. 

Finally, to make learning and generalization over the context space at all 
possible, we need to impose smoothness and restrict the variation of the mean 
functions [i a . Specifically, we assume that each \x a is Lipschitz with a modulus 
of variation A > 0: 

\fi a (x) -fi a (x')\ < Xd(x,x'), Vx,x' G X,Va. (2) 

3.2 Objective 

The goal of contextual MAB is to find a strategy for the following "game" . The 
game proceeds in rounds t = 1, 2, .... At each round t we observe Xt £ X\ we 
suppose that we have no control over how xt is generated from the set X and 
furthermore that the mechanics leading to the selection of an xt are independent 
from whatever happened in all previous rounds of the game. Given xt, we have 
to choose an action at G {1, . . . , k}. Executing action a* lets us then observe the 
reward r at (x t ) which is a random sample from lZ at (x t ). Our goal is to use the 
results of the t — 1 previous rounds of the game, i.e., the history 

(x 1 ,a 1 ,r ai (x 1 ), . . ..art-^at-i.r *- 1 ^!)), 

to determine an action a t such that the regret - a measure of performance - is 
minimized. In the bandit literature two types of regret are considered: here we 
take the cumulative regret which assumes that we have to play the game for a 
fixed number T of rounds and that we want to minimize over all T rounds the 
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difference between the expected reward of the best possible action minus the 
expected reward of the action chosen at that round: 



where fJ,*(x t ) = max ae { 1: ... :fc} fi a (x t ). 

3.3 Illustration 

Figure Q] tries to depict the whole situation graphically. In it, we chose to draw X 
as a straight line, suggesting that it is a continuous and compact space (however 
in theory it can equally well be a discrete or finite space). Mean reward n a {x) 
is drawn as a continuous function over X; the figure shows two mean functions 
// 1 (x) and ^ 2 {x) corresponding to two possible actions a £ {1,2}. For each 
location x and action a we have a separate reward distribution lZ a (x) from 
which random samples r a (x) are gathered. 

The goal is to find for each x the action with the highest expected reward, 
which in our illustration is the curve which is on top of the other curve. As the 
figure shows, in general we will not have the situation that one and the same 
action is optimal everywhere. Instead, because of the smoothness^] assumptions 
we made for /i a , there will be "regions" where one action is optimal, and regions 
where another one is optimal. 

Note that the figure is somewhat misleading in the way the random samples 
are shown; from Figure QJtop) it appears as if multiple samples from the same 
distribution (i.e., same location and same action) can be gathered. This however 
is exactly the situation we do not have (it would correspond to the traditional 
multi-armed bandit scenario). A more realistic representation of the situation 
we face is thus Figure [ljbottom); it shows how the samples are spread out over 
different locations and actions and motivates why at all it becomes necessary to 
average and generalize over the context space X . 

3.4 CMABFAS High-level overview 

Our algorithm CMABFAS works as follows. For each action a separately, we 
incrementally construct over time t — 1,2,... a cover of the context space X. 
The cover consists of ball-shaped regions where individual balls are centered on 
a certain subset of the contexts {xi, . . . , Xt-i} seen so far. The cover is hierar- 
chical with the radius of the balls exponentially fast decreasing with the level 
of hierarchy; e.g., X is covered at level 1 by a single ball of radius 1, at level 

2 Note that in our mathematical formulation of the problem smoothness is only im- 
posed over the means of the distribution. The actual form of the distribution (such 
as being concentrated around the mean, being multi-modal, etc.) could vary from 
location to location and thus also impact the practical performance. 



T 




(3) 
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Fig. 1. Contextual multi-armed bandit with two actions (blue) and (red) over 
context space X (in our case X will be the space of SIP headers). Each 
point/location x € X is associated with an action-dependent reward distribution 
the mean of which is denoted by the blue and red curve. Top: sample rewards 
are observed at location x' and x" with the shaded area denoting the underlying 
distribution. Bottom: a more realistic illustration; in practice, multiple reward 
samples are rarely obtained at the same location. Instead they are spread out 
and need to be aggregated in a judicious way to estimate the expected reward 
at a new query location x%. 



2 by balls of radius 1/2, at level 3 by balls of radius 1/4, and so on (see Fig- 
ure [2]). Each ball aggregates the reward samples lying within: we only store their 
number and their sum. Each ball covering x% can thus be used to estimate the 
expected reward fi a (x t ) at query point x t - Since we have to balance exploration 
and exploitation, we will augment this estimate by a UCB-like term (i.e., an 
upper confidence bound). Each ball covering x t thus gives rise to a score which 
is composed of two parts: (1) the sample average within the ball, and (2) an 
uncertainty term which depends on the number of samples (the fewer samples 
we have in a ball, the less certain we can be about the correctness of their aver- 
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Context space (x=sample locations) 




Level 1 (radius=1) Level 2 (radius=1/2) Level 3 (radius=1/4) 



Fig. 2. Adaptively covering the context space with ball-shaped regions (see text) 



age) and the volume of the ball (the larger the region the ball covers, the more 
variation of \i a is possible and thus the more samples of a different base quantity 
are lumped together). The "best ball" for each action is the one with the lowest 
score (the tightest upper bound), and among them the highest score indicates 
the best action. The cover is adaptively refined by adding new balls according 
to the following rules: (1) a new ball can only be created centered at (xt,a t ); 
(2) a new ball can only be created if the number of samples in the parent ball 
exceeds a certain threshold (which grows inverse quadratically in the radius of 
the parent ball); (3) a new ball is only created if it will not overlap with already 
existing balls at the same level in the hierarchy. Informally speaking, CMABFAS 
works by ensuring that high resolutions are only attained in regions of X where 
the corresponding action is optimal and then exploiting that balls with increas- 
ingly smaller radius provide increasingly more accurate bounds for the expected 
reward. 

From a practical point of view our algorithm possesses two properties which 
make it ideally suited for operation under heavy-load real-world conditions: (1) 
it is an online algorithm whose computational complexity and storage require- 
ments are very low (this is so because we do not have to store and operate on 
an always growing number of individual data points, but only have to store and 
operate on balls of a fixed radius, which is a much smaller numbeJl and where 
search operations can be efficiently implemented by appropriate space partition- 
ing methods such as cover trees [3]; (2) it is an anytime algorithm which aims 
at producing the best possible solution in each step of its operation and does 
not need any kind of prior learning phase with data or tuning to start producing 
meaningful results. 



3 The number of balls of radius r is trivially upper bound by the r-packing number 
of A'. A more tighter bound can be achieved by the near-optimality dimension or 
related concepts (e.g., see |4I22| ) which also take into account the specifics of the 
actual problem, i.e., fj, a . 
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3.5 CMABFAS Notation 

Before we come to the details of the algorithm, we need to introduce some more 
notation. Let Bf £ {1, . . . ,n%} be the index of the i-th ball and n a B the total 
number of balls in the cover for action a. Let x(Bf) £ X denote the location 
of ball Bf, that is, the location of its center in X, and let r{Bf) £ [0, 1] denote 
its radius. We say that an element x £ X lies in ball Bf, written as x £ Bf, if 
d(x, x{Bf)) < r(Bf). Overloading the notation, we can identify with B % a also the 
region Bf = {x £ X \ d{x,x{Bf)) < r(Bf)} C X. Let n t (Bf) denote the number 
of all the samples gathered up to time t lying in Bf, and let Qt(Bf) denote their 
corresponding sum of rewards. Let 

avg t (S?) := Qt (Bf)/n t (Bf) 

conft(flf) :=c-y/logT/n t (B?) 
size(Bf) := 2Xr(Bf) 

where c is a constant (which in practice will become a tunable parameter of 
the algorithm). We say that ball Bf is full (able to spawn a child), whenever 
con£ t (Bf) < r(Bf). 



3.6 CMABFAS Algorithm 

Initialization: At time t = we initialize the individual cover for each action 
with a single ball: we create a ball centered on an arbitrary element x £ X and 
set its radius to 1 (such that it covers the whole space): 

Va = 1, . . . , k : create Bf with 

x(Bf) ;= any element of X 

r{Bf) := 1, n (B<£) := 0, Qo (Bf) := 0. 

Every step: Now suppose that at time t Xt arrives. For each action a separately, 
we first compute the indices of those balls that contain xt, which we will call 
active balls for x t : A a (x t ) := {Bf\x t £ Bf}. From the set of active balls we then 
compute the set of relevant balls which consists of all balls Bf £ A a (x t ) which 
are either not full or allow the creation of a child (with radius ^r(Bf)) centered 
on x t such that it does not overlap (distance at least ^r(Bf)) with an already 
existing ball at this level of the hierarchy: R a (x t ) := {Bf £ A a (x t ) \ conf t (i?f ) > 
r{Bf) V $B? £ A a {x t ) : r(B*) = \r{Bf)}. For each ball in R a (x t ) we then 
compute its current score, which is a high-probability upper bouncjfl for the 
error we make when we use the current in-ball sample average as a proxy for the 
true but unknown expected reward /i a (xt). We take the minimum over all the 

4 The argument goes as follows. Let xt be the current context and take any active ball 
Bf £ A a (x t ). Let S„ be the set of indices of previous samples lying in the ball and 
|SVi| = n be their number. Applying Azuma-Hoeffding for martingales with bounded 
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upper bounds as score u(xt,a) for the action in question (i.e., the tightest upper 
bound): 

u(x t ,a) := mm hvg t (B?) + con{ t (B?) + wat(Bf)] . (4) 

B?€R a (x t ) 

Having such a u-score for each possible action a, we then choose the action which 
achieves the highest u-score: 

a* t := argmax u(xt,a). (5) 

a— l...k 

The system executes action <zj? and observes the stochastic outcome r°* (a;*) which 
is drawn from the unknown distribution H a t (xt). We then use this new obser- 
vation to update all the balls that were active for the action chosen: 

VS.? G A< (x t ) : Q t+ i(B?) = Qt {Bt) + r< (x t ) 
n t+1 (B?) = n t (B?) + l 
all remaining balls: g t +i{B^) = g t {Bf) 
fi t+ i(B?)=n t (fl i °). 

Adaptive refinement: Having determined aj? and before updating, we test if 
the ball B* achieving the minimum in Q for action ajf is full (and thus allowed 
to spawn a child) . If it is full, we add a new ball i?* hi i d with center xt and radius 
^r(B*) and add its index to the list of active balls. 

4 CMABFAS for SPIT 

This section describes how we can map our original problem, detecting and 
preventing SPIT calls (as described in Section 2), to the general self- learning 
decision-making framework CMABFAS described in the previous section. This 
is done as follows: 



increments together with the union bound, one can show that 

- fJ. a (x s )] 



V sSS„ 



<c-^l > 1 



Using the Lipschitz assumption Q together with the fact that both xt and x s , Vs, 
lie in the same ball Bf with radius r(B") we then have that 

\fi a (x s ) -n a (x t )\ <\-d{x s ,x t ) <X2r{B?) Vi- 
and thus fi a {x a ) < fJ, a (xt) + \2r(B?). Substituting jj, a (xs) accordingly gives us a 
lower bound for the left side inside P(-), and, noting that A X^sgs r a (x s ) — ~rgjh , 
we obtain as claimed 



Qt(Bf) a 



MB?. 
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4.1 Defining the context space 

The context space X is chosen to be the space of all possible VoIP calls, which 
wc represent by the information contained in the SIP header. Specifically, we 
extract the fields source IP addresses, contact information for caller, callees and 
optional vias, plus fields that have a phone-specific value such as user-agent 
string, preferred codec and source port number. The SIP addresses of both par- 
ties are further split into user and host names. The result is combined to form a 
vector of 16 text strings. For example, one such call x G X is of the form 

x =["208.51.215.203" , "193.22.119.20" , "5838565" , 
"208.51.215.203" , "193.22. 119.20" , "5838565" , 
"208.51.215.203", "87008888", "208.51.215.203", 
"87008888" , "Cisco-SIPGateway/IOS-12.x" , "208.51.215.203" , 
"CiscoSystemsSIP-GW-UserAgent", " 208.51.215.203", 
"18660 G723/8000"] 

To measure distances in X, we define a metric over SIP headers in the following 
way: 

,, ,. |0 , if countfx, x') = 16 

v ' |2- count ( a: ' a: ) otherwise 

where count (x,x') computes the Hamming distance and returns the number of 
string attributes that are identical in x and x' (count performs a string com- 
parison for each attribute individually). As an example, under this metric d 
two calls have distance d(x, x') = if each attribute in x is identical to its 
counterpart in x' , distance d(x,x') = \ if only 1 attribute in x and x' agrees, 
and distance d(x, x') = 1 if no attribute in x and x' agrees, that is, x and x' 
are completely different. Note that with this definition of d the normalization 
requirement diam(X) = 1 is fulfilled. 



4.2 Defining the actions 

The action the system has to decide about consists of choosing which particular 
security test out of many possible ones to apply to a given call. We assume 
that both human and automated bot will either pass or fail to pass a security 
test with a certain probability. In general there will be different types of security 
tests with each of them being of a certain difficulty and thus inducing different 
probabilities for success and costs. In our experiments we simplify this setting 
and define initially two abstract security tests which we call Type-1 and Type-2. 
For each security test and call x e X we assign synthetic success probabilities 
which we design in such a way that different kinds of bots exist having each 
different capabilities to bypass a particular security test (as will be explained 
in more detail below). Overall, the system has the following three actions at its 
disposal: 
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A3 






Al 
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A3 






Al 


A2 


A3 


normal 


1.0 


0.9 


0.8 




normal 


120 


28 


36 




normal 





92 


84 


honey pot 


1.0 


0.5 


0.3 




honey pot 


30 


15 


49 




honeypot 


19 


34 





voipbot 


1.0 


0.3 


0.5 




voipbot 


30 


49 


15 




voipbot 


19 





34 


warvox 


1.0 


0.1 


0.3 




warvox 


30 


83 


49 




warvox 


53 





34 


spitter 


1.0 


0.3 


0.3 




spitter 


30 


49 


83 




spitter 


53 


34 





(a) Chosci 


i success probs (b) Resulting 


^ expected reward 


(c) Resulting regret 



Fig. 3. Reward specification (see text). The optimal action is shown in bold face. 



Al: Apply no security test and directly pass the call on to the recipient 
- A2: Apply security test Type-1. If the caller is able to pass the test success- 
fully, forward the call to the recipient. If the caller is not able to pass the 
test successfully in one attempt, flag the call as SPIT. 
A3: Apply security test Type-2 and proceed as in A2. 

4.3 Defining the rewards 

Defining the rewards is the one rather difficult modeling choice we face. The 
reward is a single scalar quantity that must capture the performance of the 
SPIT filter. It has to be defined in such a way that by choosing actions which 
optimize it (which is what CMABFAS does), the SPIT filter does what we, as 
its designer, want it to do. In our case here the reward has to account for two 
things: (1) did we make the right decision in letting through a call or rejecting 
it; and (2) how "expensive" were the security tests necessary to arrive at this 
decision. While we imagine that the latter can be designed by a human expert 
without too much trouble, the first item poses a serious conceptual challenge: 
whether or not a call x is SPIT is beyond the system to detect on its own and 
cannot be established during runtime. Instead it would require human feedback 
much the same as an email spam filter requires labeled data or humans moving 
suspicious email to a dedicated spam folder. However, we would rather like our 
SPIT filter to be able to detect by itself if a call is SPIT (and thus generate the 
internal reward appropriately) without relying on humans pushing a red button 
every time SPIT gets through. 

Motivated by earlier work [5], we believe that one propertj^ of calls which 
can be used in this regard is call duration. The basic idea is that, on the average, 
SPIT calls will tend to be of shorter duration than NON-SPIT calls. Based 
on a large data set of collected real- world call durations [6], we will model call 
duration by an exponential distribution with mean 30 seconds for SPIT calls and 
mean 120 seconds for NON-SPIT calls. If, however, a call is flagged as SPIT, 

5 Of course, other features would also be possible, e.g., amount of double-talk, time- 
to-speech etc. (see [§]), and it is an open question of how to use these features to 
design rewards properly. Our modeling here should merely be seen as a first concept 
of proof. 
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we do not observe its call duration since the call is not physically answered, in 
this case we assign it a fixed reward of +100. The rationale for assigning +100 
is that, on the average (i) SPIT will fail the security test and NON-SPIT will 
pass it and (ii) the reward of +100 is larger than the expected call duration for 
SPIT and smaller than the expected call duration for NON-SPIT, thus making 
one of the actions A2 or A3 optimal for SPIT and action Al always optimal 
for NON-SPIT. Finally, whenever we choose action A2 or A3 we always incur, 
regardless of the outcome, a fixed cost which was set to -100. In summary, the 
reward is generated according to the following rule: 

- Generating the reward for applying action Al to call x: 

• if x € SPIT, reward r A1 (x) is sampled from Exp(30) 

• if x € NON-SPIT, reward r A1 (x) is sampled from Exp(120) 

- Generating the reward for applying action A2/A3 to call x: 

• if a; passes the security test (the probability of which depends on x and A2/A3) 

* if x G SPIT, reward r A2/A3 (x) is sampled from Exp(S0) minus cost_of_A2/A3 

* if x £ NON-SPIT, reward r A2/A3 (x) is sampled from Exp(120) minus 
cost_of_A2/A3 

• else if x fails the security test 

* reward r A2/A3 (x) := 100 minus cost_of_A2/A3 

To model if a call x is able to pass a security test, we generate a single Bernoulli 
trial whose mean depends on x and A2 / A3 (see Figure H^) . 

4.4 Setting up the experiment 

Finally, we have to discuss how x is related to being SPIT or NON-SPIT. Our 
experiments are based on a dataset which is a capture of 5609 calls from a real 
network operator under non-disclosure agreement. Neither SPIT nor other unde- 
sired activity was reported during this period of time. Additionally, we generated 
2827 calls using available security testing tools in a test-bed environment and 
recorded the corresponding SIP messages. Internally, x thus belongs to one of 5 
classes: normal, warvox (http : //warvox . org), spitter (http : //hackingvoip . com/ 
sec_tools.html), voipbot (http : //voipbot . gf o rge . inria. f r[ ), or honeypot 
(http://artemisa.sourceforge.net). The first class is NON-SPIT, the last 
corresponds to unsolicited scanning activity, and the other 3 remaining classes 
are all different kinds of SPIT with a different signature. In our experiment, we 
assume that each of these classes has different capabilities of passing a given 
security test, making it necessary to combat each class with a different optimal 
action. These different capabilities are implemented by assigning (by hand) dif- 
ferent success probabilites to each class for each action. The reward distributions 
that results from our choices are summarized in Figure [3J Note again that all 
these detailed mechanics are not known by the CMABFAS SPIT filter; the only 
thing the filter sees are the rewards sampled from the rule given above. 
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5 Simulation Results 

Setup To populate the context space, we first generate a corpus of 8436 SIP 
headers as described in the previous section (5609 of type normal, 870 of type 
spitter, 6 of type honeypot, 80 of type warvox, 1861 of type voipbot). Every time 
we simulate a new call, we draw a random header uniformly from this corpus and 
present it to the CMABFAS learner. We consider three scenarios of increasing 
difficulty: one where the SPIT filter has to choose among three different actions 
A1,A2,A3, one where it has to choose among 10 different actions Al,. . .,A10, 
and one where it has to choose among 50 different actions Al,. . .,A50. The three 
action scenario was described in the previous section, the other scenarios are 
obtained by just adding more choices of security tests to the disposal of the 
SPIT filter and setting success probabilities accordingly. Note that increasing 
the number of choices makes the task of finding the best choice more difficult 
(in that it requires more exploration). The stochastic rewards are scaled to lie in 
[0, 1] and are generated as shown in Figure [3] (actions A4. . .A50 are populated 
similarly). The Lipschitz constant A from Eq. @ is set to 1. We perform a 
total of 10 independent runs and average the results; each single run consists of 
sequentially processing 10,000,000 independent calls. 

Baseline To properly evaluate the performance our algorithm CMABFAS, we de- 
fine a naive baseline method which works by incrementally (but non-adaptively) 
clustering the input space X and assigning a standard MAB with UCBi(1.2) 
rule [T] to each cluster. Specifically, it works like this: let Xt be the current call. 
Find the nearest cluster according to the Hamming distance. If the distance to 
the nearest cluster is greater than some parameter max_radius and the number 
of current clusters is below some other parameter max_clusters, we add a new 
cluster and initialize its counter to zero. Otherwise we assign x t to the nearest 
cluster with index i* and choose action a* such that 



where g t {i*,a) is the sum of rewards for action a, n t (i*,a) the number of sam- 
ples for action a, and n t (i*) the total number of samples within cluster i* . 
Choosing a*, we observe, as before, a reward r" (xt) after which we increment 
Qt(i*,a*), nt(i*,a*), and n t (i*) accordingly. The hyperparameters of the algo- 
rithm, max_radius and max_clusters, we chosen by a coarse grid search: best 
performance was achieved for max_radius=6 and max_clusters=500 (our re- 
sults also include some other combinations.) 

Results The resulting performance of both CMABFAS and the baseline is shown 
in Figure [5] in terms of the cumulative regret, while Table Q] shows the results 
numerically in greater depth. Figure U illustrates the partitioning behavior of 
CMABFAS over time. In summary, the results show that CMABFAS is about 
an order of magnitude better than the best parameter setting of the baseline. 



a* = argmax 



a 
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Number of calls [t] Number of calls [t] 

Fig. 4. Left: number of balls CMABFAS creates over time (3 actions). Right: 
How the minimum radius changes over time (3 actions) 



The curves reflect the kind of learning behavior that we would have expected 
and which is typical for MAB algorithms of this kind: initially, the reward dis- 
tributions are unknown and thus the algorithm has to "explore" and try out the 
various actions over different regions in the context space. Over time, and this 
happens very rapidly with CMABFAS, the granularity of the ball-cover of the 
context space is refined in areas of high data-density and the estimates for the 
mean reward become more accurate, this in turn makes the algorithm become 
more confident about its decisions and explore less. The performance CMABFAS 
reaches at the end is nearly optimal: both the regret and the number of mistakes 
approach zero (averaged over all calls, the algorithm makes the correct decision 
with > 99.9%). We also note that CMABFAS appears to scale well when the 
number of available actions is increased (see the 10 action and 50 action results). 
Finally, recall that CMABFAS is an anytime algorithm. If we would continue to 
run it and process additional calls, the error rate would further decrease until 
(asymptotically) no more mistakes are made. 



6 Conclusion 



In this paper we have undertaken first steps towards making a complex decision- 
making SPIT filter (i.e., a SPIT filter which has to choose among more than two 
alternatives without access to prior labeled data and only based on stochastic 
and sparse feedback) become fully self-learning by formulating it as a contextual 
multi-armed bandit. The simulation results are encouraging; it should be noted 
though that due to the nature of the problem (SPIT is largely hypothetical and 
barely existent nowadays but believed to be a potential threat as VoIP becomes 
more widespread in the future) our results are biased by the modeling decisions 
we had to make (e.g., setting success probabilities by hand). Nevertheless, we 
believe that this research is both highly innovative and useful and could also be 
applied to other security-related problems which can be formulated in a similar 
way. 
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Table 1. Quantitative results of the SPIT filter. We compare CMABFAS with a naive baseline implementation for various 
settings of its hyperparameters (see text). The table shows the performance at three different points in time: at the beginning 
of the learning (after 10,000 calls), towards the middle of the learning (after 100,000 calls), and towards the end of learning 
(after 10,000,000 calls). Performance is given in terms of regret/t, where regret is equal to the sum of the expected reward 
obtained when we would have chosen the best action minus expected reward of the action that our SPIT filter has chosen (thus 
zero regret means we have always chosen the best action). The column nmistakesl shows the number of times the SPIT filter 
chooses an action which is not optimal with respect to the expected reward (an error which can mean, for example, that the 
SPIT filter correctly chose to apply a security challenge to SPIT call but not the security challenge with the highest probability 
of success/least cost ratio). The column nmistakcs2 shows the number of times the SPIT filter chooses action AO for a SPIT 
call (i.e., fails to block SPIT) or chooses one of actions Al. . .A50 for a NON-SPIT call (i.e., applies a security challenge to a 
NON-SPIT call). The best result for each case is marked in bold face; we can see that in terms of regret CMABFAS is about 
an order of magnitude better than the best baseline. 
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10* 10* 10 6 10* 10 4 1 6 10* 10 4 10 6 

Number of calls [t] Number of calls [t] Number of calls [t] 



3 available actions 10 available actions 50 available actions 

Fig. 5. Comparing CMABFAS with naive clustering and standard UCBi MAB. 
Lower numbers indicate better performance. 
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